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Abstract — We present algorithms for the discrete cosine trans- 
form (DCT) and discrete sine transform (DST), of types II 
and III, that achieve a lower count of real multiplications and 
additions than previously published algorithms, without sacrific- 
ing numerical accuracy. Asymptotically, the operation count is 
reduced from 2TVlog 2 TV + O(N) to ^TVlog 2 TV + O(N) for a 
power-of-two transform size TV. Furthermore, we show that an 
additional TV multiplications may be saved by a certain rescaling 
of the inputs or outputs, generalizing a well-known technique for 
TV = 8 by Arai et al. These results are derived by considering 
the DCT to be a special case of a DFT of length 4 TV, with 
certain symmetries, and then pruning redundant operations from 
a recent improved fast Fourier transform algorithm (based on a 
recursive rescaling of the conjugate-pair split radix algorithm). 
The improved algorithms for the DCT-III, DST-II, and DST-III 
follow immediately from the improved count for the DCT-II. 

Index Terms — discrete cosine transform; fast Fourier trans- 
form; arithmetic complexity 

I. Introduction 

In this paper, we describe recursive algorithms for the type- 
II and type-Ill discrete cosine and sine transforms (DCT-II and 
DCT-III, and also DST-II and DST-III), of power-of-two sizes, 
that require fewer total real additions and multiplications than 
previously published algorithms (with an asymptotic reduction 
of about 5.6%), without sacrificing numerical accuracy. Our 
DCT and DST algorithms are based on a recently published 
fast Fourier transform (FFT) algorithm, which reduced the 
operation count for the discrete Fourier transform (DFT) 
compared to the previous-best split-radix algorithm [1]. The 
gains in this new FFT algorithm, and consequently in the 
new DCTs and DSTs, stem from a recursive rescaling of 
the internal multiplicative factors within an algorithm called a 
"conjugate-pair" split-radix FFT [2]-[5] so as to simplify some 
of the multiplications. In order to derive a DCT algorithm from 
this FFT, we simply consider the DCT-II to be a special case 
of a DFT with real input of a certain even symmetty, and 
discard the redundant operations [1], [6]-[10]. The DCT-III, 
DST-II, and DST-III have identical operation counts to the 
DCT-II of the same size, since the algorithms are related by 
simple transpositions, permutations, and sign flips [1 1] — [13]. 

Since 1968, the lowest total count of real additions and 
multiplications, herein called "flops" (floating-point opera- 
tions), for the DFT of a power-of-two size TV = 2 m was 
achieved by the split-radix algorithm, with 4TV log 2 TV— 6TV+8 
flops for TV > 1 [6], [8], [14]-[16]. This count was re- 
cently surpassed by new algorithms achieving a flop count 
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of ^TVlog 2 TV + 0(TV) [1], [17]. Similarly, the lowest-known 
flop count for the DCT-II of size TV = 2 m > 1 was previously 
2TV log 2 TV — TV + 2 for a unitary normalization (with the 
additive constant depending on normalization) [6], [7], [12], 
[13], [18]— [26], and could be achieved by starting with the 
split-radix FFT and discarding redundant operations [6], [7], 
(Many DCT algorithms with an unreported or larger flop count 
have also been described [27]-[40].) Based on our new FFT 
algorithm, the flop counts for the various DCT types were 
reduced using a code generator [10], [41] that automatically 
pruned the redundant operations from an FFT with a given 
symmetry, but neither an explicit algorithm nor a general 
formula for the flop count were presented except for DCT-I 
[1]. In this paper, we use the same starting point to "manually" 
derive a DCT-II algorithm by pruning redundant operations 
from a real-even FFT, and give the general formula for the 
new flop count (for TV = 2 m > 1): 

yTVlog 2 TV - 1 1n - N log 2 TV 

+ ^(_ 1) io g2 ^ + | (1) 

The first savings over the previous record occur for TV = 16, 
and are summarized in Table I for several TV. We also consider 
the effect of normalization on this flop count: the above count 
was for a unitary transform, but slightly different counts are 
obtained by other choices. Moreover, we show that a further 
TV multiplications can be saved by individually rescaling every 
output of the DCT-II (or input of the DCT-III). In doing so, 
we generalize a result by Arai et al, who showed that eight 
multiplications could be saved by scaling the outputs of a 
DCT-II of size TV = 8 [42], a result commonly applied to 
JPEG compression [43]. 

If we merely wished to show that the DCT-II/III could be 
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computed in ^-N\og 2 N + O(N) operations, we could do 
so by using well-known techniques to re-express the DCT in 
terms of a real-input DFT of length N plus 0(N) pre/post- 
processing operations [29], [30], [33], [44]-[46], and then 
substituting our new FFT that requires ^7Vlog 2 N + O(N) 
operations for real inputs [1]. However, with FFT and DCT 
algorithms, there is great interest in obtaining not only the best 
possible asymptotic constant factor, but also the best possible 
exact count of arithmetic operations (which, for the DCT-II of 
power-of-two sizes, had not changed by even one operation for 
over 20 years). Our result (1) is intended as a new upper bound 
on this (still unknown) minimum exact count, and therefore 
we have done our best with the O(N) terms as well as the 
asymptotic constant. It turns out, in fact, that our algorithm 
to achieve (1) is closely related to well-known algorithms for 
expressing the DCT-II in terms of a real-input FFT of the same 
length, but it does not seem obvious a priori that this is what 
one obtains by pruning our FFT for symmetric data. 

In the following sections, we first review how a DCT-II may 
be expressed as a special case of a DFT, and how the previous 
optimum flop count can be achieved by pruning redundant 
operations from a conjugate-pair split-radix FFT. Then, we 
briefly review the new FFT algorithm presented in [1], and 
derive the new DCT-II algorithm. We follow by considering 
the effect of normalization and scaling. Finally, we consider 
the extension of this algorithm to algorithms for the DCT-III, 
DST-II, and DST-III. We close with some concluding remarks 
about future directions. We emphasize that no proven tight 
lower bound on the DCT-II flop count is currently known, and 
we make no claim that eq. (1) is the lowest possible (although 
we have endeavored not to miss any obvious optimizations). 

II. DCT-II IN TERMS OF DFT 

Various forms of discrete cosine transform have been de- 
fined, corresponding to different boundary conditions on the 
transform [47]. Perhaps the most common form is the type- 
II DCT, used in image compression [43] and many other 
applications. The DCT-II is typically defined as a real, or- 
thogonal (unitary), linear transformation by the formula (for 
k = 0,...,N -1): 
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for N inputs x n and N outputs C", where 5k,o is the 
Kronecker delta (= 1 for k = and = otherwise). However, 
we wish to emphasize in this paper that the DCT-II (and, 
indeed, all types of DCT) can be viewed as special cases of the 
discrete Fourier transform (DFT) with real inputs of a certain 
symmetry, and where only a subset of the outputs need be 
computed. This (well known) viewpoint is fruitful because it 
means that any FFT algorithm for the DFT leads immediately 
to a corresponding fast algorithm for the DCT-II simply by 
discarding the redundant operations [1], [6]-[10]. 

The discrete Fourier transform of size N is defined by 
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Fig. 1. A DCT-II of size N = 8 (open dots, xq, . . . ,x*j) is equivalent to 
a size-4iV DFT via interleaving with zeros (black dots) and extending in an 
even (squares) periodic (gray) sequence. 



where u>n = e~~ is an jVth primitive root of unity. In 
order to relate this to the DCT-II, it is convenient to choose a 
different normalization for the latter transform: 
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This normalization is not unitary, but it is more directly related 
to the DFT and therefore more convenient for the development 
of algorithms. Of course, any fast algorithm for Ck trivially 
yields a fast algorithm for C", although the exact count of 
required multiplications depends on the normalization, an issue 
we discuss in more detail in section V. 

In order to derive Ck from the DFT formula, one can use 
the identity 2cos(ir£/N) = Jf N + Lj^ 2e to write: 
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where x n is a real-even sequence of length AN (i.e. x^_ n 
x n ), defined as follows for < n < N: 

X2n = %4N-2n-2 = 0, 



2^277+1 — 2J4iV-(277+l) 



(6) 
(7) 



Thus, the DCT-II of size N is precisely the first N outputs of 
a DFT of size AN, of real-even inputs, where the even-indexed 
inputs are zero. 

This is illustrated by an example, for N — 8, in figure 1. 
The eight inputs of the DCT are shown as open dots, which are 
interleaved with zeros (black dots) and extended to an even 
(squares) periodic (gray dots) sequence of length AN = 32 
corresponding to the DFT. (The type-II DCT is distinguished 
from the other DCT types by the fact that it is even about 
both the left and right boundaries of the original data, and the 
symmetry points fall halfway in between pairs of the original 
data points.) We will refer, below, to this figure in order to 
illustrate what happens when an FFT algorithm is applied to 
this real-symmetric zero-interleaved data. 
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Algorithm 1 Standard conjugate-pair split-radix FFT algo- 
rithm of size N. Special-case optimizations for fc = and 
fc = N/ 8, as well as the base cases, are omitted for simplicity. 

function Xk=o..N-i <— splitfft^ (x n ): 

U k2 =0... N/2-1 <- Splitfftjv/2 (X 2 n 2 ) 
Z k 4 =0... N/4-1 SplitfFtjy/4 (X4 n4 + i) 
Z 'k 4 =0 N/4-1 *~ SplitfFtjv/4 (x 4 „ 4 _i) 

for fc = to N/4 - 1 do 

Xk <— Uk + {io^Zk + Lu N k Z' k ) 
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-X"fc4-3JV74 <— Uk+N/4 + * {^N Z k ~ Wjy ' Z 'k) 



L fc+3AT/4 

end for 



III. Conjugate-pair FFT and DCT-II 

Although the previous minimum flop count for DCT-II 
algorithms can be derived from the ordinary split-radix FFT al- 
gorithm [6], [7] (and it can also be derived in other ways [12], 
[13], [18]-[26]), here we will do the same thing using a variant 
dubbed the "conjugate-pair" split-radix FFT. This algorithm 
was originally proposed to reduce the number of flops [2], 
but was later shown to have the same flop count as ordinary 
split-radix [3]-[5]. It turns out, however, that the conjugate- 
pair algorithm exposes symmetries in the multiplicative factors 
that can be exploited to reduce the flop count by an appropriate 
rescaling [1], which we will employ in the following sections. 

A. Conjugate-pair split-radix FFT 

Starting with the DFT of equation (3), the decimation-in- 
time conjugate-pair FFT splits it into three smaller DFTs: one 
of size N/2 of the even-indexed inputs, and two of size AT/4: 

N/2-1 N/4-1 

W A72 X2 ™2 + W 7V E/ w 7V/4 2 ' 4 ™4 + l 
712— n 4 — 

N/4-1 

+ UJ N k </4^4„ 4 -l, (8) 

n 4 =0 

where the indices 4n 4 ± 1 are computed modulo N. [In 
contrast, the ordinary split-radix FFT uses X4„ 4+3 for the 
third sum (a cyclic shift of x 4 „ 4 _i), with a corresponding 
multiplicative "twiddle" factor of cuff.] This decomposition 
is repeated recursively (until base cases of size N = 1 or 
N = 2, not shown, are reached), as shown by the pseudo- 
code in Algorithm 1. Here, we denote the results of the three 
sub-transforms of size N/2, N/4, and N/4 by U k , Z k , and Z' k , 
respectively. The number of flops required by this algorithm, 
after certain simplifications (common subexpression elimina- 
tion and constant folding, and special-case simplifications of 
the constants for fc = and fc = N/8) and not counting 
data-independent operations like the computation of ui^, is 
4N log 2 A^ — 6 A^ + 8, identical to ordinary split radix. 

In the following sections, we will have to exploit further 
simplifications for the case where the inputs x n are real. In 
this case, X k = X*L k (where * denotes complex conjugation) 
and one can save slightly more than half of the flops, both 



Algorithm 2 Standard conjugate-pair split-radix FFT algo- 
rithm of size A" as in Algorithm 1, but specialized for the 
case of real inputs. Special-case optimizations for fc = and 
fc = N / 8, as well as the base cases, are omitted for simplicity. 

function X k = „ N / 2 <— rsplitfft^ (x„): 
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rsplitfftjv/2 (X2n 2 ) 



z k 4 =o...N/s <- rsplitfffcjv/4 (x 4 „ 4+ i) 

Z 'k 4 =0...N/8 <~ rsplitfftjv/4 {X 4n4 -l) 

for fc = to N/8 do 
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end for 



in the ordinary split-radix [7], [48] and in the conjugate-pair 
split-radix [1], by eliminating redundant operations, to achieve 
a flop count of 2A r log 2 A^ — 4A^ + 6. Specifically, one need 
only compute the outputs for < fc < N/2 (where the fc = 
and fc = N/2 outputs are purely real), and the corresponding 
algorithm is shown in Algorithm 2. Again, for simplicity we do 
not show special-case optimizations for fc = and fc = N/8, 
so it may appear that the X N / S , X N / i7 and X 3N / S outputs are 
computed twice. To derive Algorithm 2, one exploits two facts. 
First, the sub-transforms operate on real data and therefore 
have conjugate-symmetric output. Second, the twiddle factors 
in Algorithm 1 have some redundancy because of the identity 



±(JV/4-fc) 
J N 



)% k , which allows us to share twiddle factors 



between fc and A^/4 — fc in the original loop. 

B. Fast DCT-II via split-radix FFT 

Before we derive our fast DCT-II with a reduced flop 
count, we first derive a DCT-II algorithm with the same flop 
count as previously published algorithms, but starting with the 
conjugate-pair split-radix algorithm. This algorithm will then 
be modified in section IV-B, below, to reduce the number of 
multiplications. 

In eq. (5), we showed that a DCT-II of size N, with inputs 
x n and outputs C k , can be expressed as the first N outputs 
of a DFT of size A^ = 4A^ of real-even inputs x n . [Therefore, 
A^ in eq. (8) above and in the surrounding text are replaced 
below by N.] Now, consider what happens when we evaluate 
this DFT via the conjugate-pair split decomposition in eq. (8) 
and Algorithm 2. In this algorithm, we compute three smaller 
DFTs. First, U k is the DFT of size A^/2 = 2N of the x 2n 
inputs, but these are all zero and so Uk is zero. Second, Z k is 
the DFT of size A^/4 = A^ of the Xi n+ i inputs (where 4n+ 1 
is evaluated modulo N), which by eq. (7) correspond to the 
original data x n by the formula: 

< n < N/2 



Vn 



X4n+1 



X2r, 



X 2N -l-2n N/2 <n<N 



(9) 



where we have denoted them by y n (0 < n < N) for 
convenience. That is, Z k is the real-input DFT of the even- 
indexed elements of x n followed by the odd-indexed elements 
of x n in reverse order. For example, this is shown for A^ = 8 
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Fig. 2. The DCT-II oi N = 8 points x n (open dots) is computed, in a 
conjugate-pair FFT of the N = 32 extended data x n (squares) from figure 1, 
via the DFT of the circled points X4 n +i, corresponding to the even- 
indexed x n followed by the odd-indexed x n . 



Algorithm 3 Fast DCT-II algorithm, matching previous best 
flop count, derived from Algorithm 2 by discarding redundant 
operations. 

function Ck=o..N-i <— splitdctll^ (x n ): 
for n = to N/2 - 1 do 

VN-l-n %2n+l 

end for 

Zk=o...N/2 <- rsplitfftjv (y n ) 
Cq <— 2Zq 

for k = 1 to AT/2 - 1 do 

Cfe <— 23? (u 4N Zk) 

Cn-u < 23 (w AN Zk) 

end for 

CV/2 <— V2Z N /2 



in figure 2 with the circled points corresponding to the y n , 
which are clearly the even-indexed x n followed by the odd- 
indexed x n in reverse. The second DFT, Z' k , of size N/4 = N 
is redundant: it is the DFT of x 4rl -i, but by the even symmetry 
of x n this is equal to ^4(_ n )+i, and therefore Z' k = Z k (the 
complex conjugate of Z k ). 

Therefore, a fast DCT-II is obtained merely by computing 
a single real-input DFT of size N to obtain Zk = Z* N _ k , 
and then combining this according to Algorithm 2 to obtain 
Cfc = Xk for k = . . . N — 1. In particular, in the loop of 
Algorithm 2, all but the Xfc and -X"jv/4-fc terms correspond to 
subscripts > N and are not needed. Moreover, we obtain 

X k = u%Z k + w^Zl = 23? (ujl N Z k ) , 

AOv-fc = ~i [oj%Z k - ufztf = -23 (uj k AN Z k ) . 

The resulting algorithm, including the special-case optimiza- 
tions for k = (where lo^ n = 1 and Z is real) and k = N/2 
(where = (1 — i)/V2 and is real), is shown in 

Algorithm 3. 

In fact, Algorithm 3 is equivalent to an algorithm derived 
in a different way by various authors [30], [33] to express a 
DCT-II in terms of a real-input DFT of the same length. Here, 
we see that this algorithm is exactly equivalent to a conjugate- 
pair split-radix FFT of length 4N. Previous authors obtained a 
suboptimal flop count § N log 2 N+0(N) for the DCT-II using 



this algorithm [33] only because they used a suboptimal real- 
input DFT for the subtransform (split-radix being then almost 
unknown). Using a real-input split-radix DFT to compute Z k , 
the flop count for Z k is 2N log 2 ^—4^+6 from above. To get 
the total flop count for Algorithm 3, we need to add A^/2 — 1 
general complex multiplications by 2w^ N (6 flops each) plus 
two real multiplications (2 flops), for a total of 2 N log 2 N — 
N + 2 flops. This matches the best-known flop count in the 
literature (where the +2 can be removed merely by choosing 
a different normalization as discussed in section V). 

IV. New FFT and DCT-II 

In this section, we first review the new FFT algorithm 
introduced in our previous work based on a recursive rescaling 
of the conjugate-pair FFT [1], and then apply it to derive a 
fast DCT-II algorithm as in the previous section. 

A. New FFT 

Based on the conjugate-pair split-radix FFT from section III, 
a new FFT algorithm with a reduced number of flops can 
be derived by scaling the subtransforms [1]. We will not 
reproduce the derivation here, but will simply summarize the 
results. In particular, the original conjugate-pair split-radix 
Algorithm 1 is split into four mutually recursive subroutines, 
each of which has the same split-radix structure but computes 
a DFT scaled by a different factor. These algorithms are 
shown in Algorithm 4 for the case of complex inputs, and 
in Algorithm 5 specialized for real inputs, in both of which 
the scaling factors are combined with the twiddle factors uj^ 
to reduce the total number of multiplications compared to 
Algorithms 1 and 2, respectively. Again, special cases for 
k = and k = N/8, as well as the N = 1,2 base cases 
and obvious simplifications such as common-subexpression 
elimination, are not shown for simplicity. 

The key aspect of these algorithms is the scale factor .k, 
where the subtransforms compute the DFT scaled by ,fc 
for t = 1,2,4. (The t — 0,1,2 subroutines are condensed 
into one in Algorithms 4-5, by making I a variable, but in 
practice they need to be implemented separately in order to 
exploit the special cases of the scale factor [1]. The t = 4 
case is written separately because it is factorized differently.) 
This scale factor is defined for N = 2 m by the following 
recurrence, where fc 4 = k mod ^: 

{1 for N < 4 

s Af/4,fc 4 cos,(2-Kki/N) for fc 4 < N/8 . 
Sn/4 fe 4 sin(27rA;4/iV) otherwise 

(10) 

This definition has the properties: sat, = 1, 5^+^/4 = s N,k, 
and Sjv,iv/4-/s = Also, s^.k > and decays rather 

slowly with N: s N>k has an fi(iV los 4 cos W5)) asymptotic 
lower bound [1]. When these scale factors are combined with 
the twiddle factors w^, we obtain terms of the form 

, ,fc S N/i,k 

tN.k=^ N , (11) 

SjV,fe 

which is always a complex number of the form ±1 ±i tan ^jf- 
or ± cot 22* ± i and can therefore be multiplied with two 
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Algorithm 4 New FFT algorithm [1] of length TV (divisible 
by 4). The sub-transforms newfftS^y (x) for £ ^ are scaled 
by SiN,k, respectively, while £ = is the final unsealed DFT 
(so,k = !)■ Special-case optimizations for k = and k = TV/8 

are not shown. 

function Xk=o..N-i <— newfftS^r {x n )\ 
{computes DFT / s eN ,k for I = 0, 1, 2} 

o 



fc 2 =0...iV/2-l 



Zk 4 =0... N/4-1 
7' 

^fe 4 =0. ..JV/4-1 



newfftSjv/2 (x 2 „ 2 ) 
newfftS 1 , 



JV/4 ( x 4n 4 + l) 

newfftS^/4 (x 4 „ 4 _i) 
for to 7V/4 - 1 do 

Xk ^ Uk + [t-N.kZk +t* N k Z' k ^J ■ (SN,k/SiN,k) 

Xk+N/2 <—Uk— [t-N.kZk + t N .k Z 'k) ' ( s N,k/s£N,k) 
Xk+N/4 



X 



fe+3JV/4 



— « [tN.kZk - t*N,k^kj ' ( s N,k/seN,k+N/i) 
N/4 U k+N /4 

i (tN.kZk — t* N k Z k ^j ■ {SN,k/ StN,k+N/i) 



+ i 
end for 

function Xk=o..N-i *— newfftS4Ar (x n ): 
{computes DFT / S4N,k} 



Uk 2 =0... N/2-1 

Zk 4 =0... N/4-1 
7' 

Zj k 4 =0... N/4-1 



newfftSjv/2 (x 2 n 2 ) 



newfftSl 



Xk+N/2 



N/4 ( x 4n 4 + l) 

newfftSjy/4 (a; 4n4 _i) 
for k = to iV/4 - 1 do 

Uk + {tN.k^k + t*N,k^Cj ' ( s N,k/ s 4N,k) 

Uk — (tN,kZk + t* N k Z kj 

■ {sN.k/ S4N.k+N/2) 

Uk+N/4 - i (tN,kZk ~ t*N,kZ'k) 

■ {SN.k/ S iN . k+N /i) 

Uk+N/4 + i (tN.kZk — t*N,kZ'k) 
{SN.k/ SiN,k+3N/i) 



X 



k+N/4 



Xk+ZN/4 

end for 



fewer real multiplications than are required to multiply by oj n . 
Because of the symmetry Sjv,jv/4-fc — s N.k, it is possible to 
specialize Algorithm 4 for real data in the same way as in 
Algorithm 2, because the scale factor preserves the conjugate 
symmetry Xk = X N _ k of the outputs of all sub transforms 
[I]. 1 

The resulting flop count, for arbitrary complex data x n , is 
then reduced from AN log 2 N - 6N + 8 to 
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TV logo N - 



124 
~27 



N - 2 log 2 N 



9 1 R 

- ^(-l) loS2 N log 2 N + ^(-l) log2 N + 8. (12) 

In particular, assuming that complex multiplications are imple- 
mented as four real multiplications and two real additions, then 
the savings are purely in the number of real multiplications. 

'In Algorithm 5, we have also utilized various symmetries of the scale 
factors, such as s 2 N,N/4-k = s 2Ar,fc+Ar/4> as described in our previous 
work [1], to make explicit the shared multiplicative factors between the 
different terms. 



Algorithm 5 New FFT algorithm of size TV (divisible by 4), as 
in Algorithm 4 but specialized for the case of real inputs. The 
sub-transforms rnewfftS^r (x) for £ ^ are scaled by sgN,k, 
respectively, while £ — is the final unsealed DFT (so,fc = 1)- 
Special-case optimizations for k = and k = TV/8 are not 

shown. 

function Xk=o..N/2 rnewfftS^r (x n ): 
{computes DFT / siN,k for I = 0, 1, 2} 
Uk 2 =o...N/4 <- rnewfftS^ /2 (x 2 „ 2 ) 



Zk 4 =o...N/8 <- rnewfftSjv/4 (z 4 „ 4+ i) 
Z L=o n/8 <~ mewfftS]v /4 {x 4 „ 4 -i) 
for k = to TV/8 do 

Xk <— Uk + [tN.kZk + t*N,k Z kj ' ( s N,k/siN,k) 



x 



k+N/4 



u 



N/4-k 



X 



* [tN.kZk - t*N,k Z 'k) ' ( s N,k/seN,k+N/4) 



N/4-k 
— I 



Un/4-k 



X 



i (tN.kZk — t* N k Z' k ^j ■ {sN,k/ SlN, k+N/4) 



N/2-k 



u* k 



- (tN,kZk + t*N,k Z 'k) ' ( s N,k/seN,k) 

end for 



function TV fc=0 ..A r /2 <— rnewfftS4Ar (x n ): 
{computes DFT / S4Ar,fc} 
=o...n/4 rnewfftSjy/2 

rnewfftSjv/4 (x 4 „ 4+ i) 
rnewfftSjv/4 (a; 4 „ 4 _i) 
for k = 1 to TV/8 do 



U k2 =o...N/4 <- rnewfftS^ /2 (x 2n2 ) 

Zk 4 =0...N/8 
7' 

^k 4 =0...N/8 



Xk 



{sN,k/ S4N,k) 



X 



k+N/4 



Uk + [tN.kZk + t* Nk z' k ^ 

U N/4-k - 1 [tN.kZk - t N M Z k) 
■ (SN,k/ S4N, k+N/4) 



X 



N/4-k 



u, 



N/4-k 



i I tN,kZk 



(sN,k/s4N,N/4-k) 



*JV,k^fc) 



X 



N/2-k 



ut 



[t N ,kZ k +t* Nik Z k) 



end for 



(sN,k/s4N,N/2-k) 



The number M(N) of real multiplications saved over ordinary 
split radix is given by: 



M(TV) = ^TV log 2 TV - pTV + 2 log 2 TV 

2 1 fi 

+ g (-l)' og2 N log 2 N - ^(~1)^ N . (13) 



If the data are purely real, it was shown that M(TV)/2 
multiplications are saved over the corresponding real-input 
split-radix algorithm [1]. These flop counts are to compute the 
original, unsealed definition of the DFT. If one is allowed to 
scale the outputs by any factor desired, then scaling by l/sN,k 
[corresponding to newfftSjy (x n ) in Algorithm 4], saves an 
additional M S (N) — M(N) > multiplications for complex 
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Algorithm 6 New DCT-II algorithm of size TV = 2 m , based on 
discarding redundant operations from the new FFT algorithm 
of size AN, achieving new record flop count. 

function Cfe=o..iv-i <— newdctllw (x n ): 
for n = to N/2 - 1 do 

y n <— x 2n 

VN-l-n <— £2n+l 

end for 

Zk=o..N/2 <- rnewfftS^ (y n ) 
Co <— 2Zo 

for fc = 1 to JV/2 - 1 do 

C fe <— 23? (uJ 4N s N ^Zh) 

C N -k < 23 (^ w s7v,fc^fc) 

end for 

Cat/2 <— \/2Z N /2 



inputs, where: 

2 2n 

M s (JV) = -JVlog 2 iV--iV 

+ ^(-l) loS2 N log 2 ^ - ^(-l) log2 N + 1. (14) 

For real inputs, one similarly saves M$(N)/2 flops for 
rnewfTtS^Y (x„) operating on real x„ in Algorithm 5. 

Although the division by a cosine or sine in the scale factor 
may, at first glance, seem as if it may exacerbate floating- 
point errors, this is not the case. Cooley-Tukey based FFT 
algorithms have 0(y/TogN) root-mean-square error growth, 
and O (log AT) error bounds, in finite-precision floating-point 
arithmetic (assuming that the trigonometric constants are 
precomputed accurately) [49]-[51], and the new FFT is no 
different [1]. The reason for this is straightforward: it never 
adds or subtracts scaled and unsealed values. Instead, wherever 
the original FFT would compute a + b, the new FFT computes 
s ■ (a + b) for some fixed scale factor s. 

B. Fast DCT-II from new FFT 

To derive the new DCT-II algorithm based on the new FFT 
of the previous section, we follow exactly the same process as 
in Sec. III-B. We express the DCT-II of length N in terms of 
a DFT of length 4N, apply the FFT algorithm 5, and discard 
the redundant operations. As before, the Uk sub transform 
is identically zero, Zk is the transform of the even-indexed 
elements of x n followed by the odd-indexed elements in 
reverse order, and Z' k = Z* N _ k is redundant. 

The resulting algorithm is shown in Algorithm 6, and differs 
from the original fast DCT-II of Algorithm 3 in only two ways. 
First, instead of calling the ordinary split-radix (or conjugate- 
pair) FFT for the subtransform, it calls newfftSAr {x n ). 
Second, because this subtransform is scaled by l/sjv,fe, the 
twiddle factor in Algorithm 6 is multiplied by s N _ k . 

To derive the flop count for Algorithm 6, we merely need 
to add the flop count for the subtransform [which saves 
M s (N)/2 = ±N\og 2 N + O(N) flops, from eq. (14), 
compared to ordinary real-input split radix] with the number 
of flops in the loop, where the latter is exactly the same 



as for Algorithm 3 because the products oj^ N s N ^ k can be 
precomputed. Therefore, we save a total of Ms{N)/2 flops 
compared to the previous best flop count of 2N log 2 N—N+2, 
resulting in the flop count of eq. (1). This formula matches the 
flop count that was achieved by an automatic code generator 
in Ref. [1]. 

Because the new DCT algorithm is mathematically merely 
a special set of inputs for the underlying FFT, it will have the 
same favorable logarithmic error bounds as discussed in the 
previous section. 

V. Normalizations 

In the above definition of DCT-II, we use a scale factor 
"2" in front of the summation in order to make the DCT- 
II directly equivalent to the corresponding DFT. But in some 
circumstances, it is useful to multiply by other factors, and dif- 
ferent normalizations lead to slightly different flop counts. For 
example, one could use the unitary normalization from eq. (2), 
which replaces 2 by \J2/N or y/l/N (for k = 0) and requires 
the same number of flops. If one uses the unitary normalization 
multiplied by VN, then one saves two multiplications in the 
k = and k = N/2 terms (in this normalization, C = Z 
and Cjv/2 = Z N / 2 ) for both Algorithm 3 and Algorithm 6. 
(This leads to a commonly quoted 2N log 2 N — N formula 
for the previous flop count.) 

It is also common to compute a DCT-II with scaled out- 
puts, e.g. for the JPEG image-compression standard where 
an arbitrary scaling can be absorbed into a subsequent quan- 
tization step [43], and in this case the scaling can save 8 
multiplications [42] over the 42 flops required for an unitary 
8-point DCT-II. Since our newfftSjy (x n ) attempts to be the 
optimally scaled FFT, we should be able to derive this scaled 
DCT-II by using rnewfftSjv (x n ) /2 for our length-47V DFT 
instead of rnewfftS^ (x), and again pruning the redundant 
operations. Doing so, we obtain an algorithm identical to 
Algorithm 6 except that 2w*l N SN,k is replaced by tiN,k> which 
requires fewer multiplications, and also we now obtain Co = 
Zq and Cjv/2 = Z N / 2 . This algorithm computes Ck/2s4N.k 
instead of Ck- In particular, we save exactly N multiplications 
over Algorithm 6, which matches the result by Ref. [42] 
for N = 8 but generalizes it to all N. This savings of TV 
multiplications for a scaled DCT-II also matches the flop count 
that was achieved by an automatic code generator in Ref. [1]. 

VI. Fast DCT-III, DST-II, and DST-III 

Given any algorithm for the DCT-II, one can immediately 
obtain algorithms for the DCT-III, DST-II, and DST-III with 
exactly the same number of arithmetic operations. In this way, 
any improvement in the DCT-II, such as the one described 
in this paper, immediately leads to improved algorithms for 
those transforms and vice versa. In this section, we review the 
equivalence between those transforms. 

A. DCT-III 

To obtain a DCT-III algorithm from a DCT-II algorithm, one 
can exploit two facts. First, the DCT-III, viewed as a matrix, 
is simply the transpose of the DCT-II matrix. Second, any 
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size-2 DCT-II 
2 




C = 2(x Q + Xl ) network c 
it transposition 

S> 




-V2 



Fig. 3. Example of network transposition for a size-2 DCT-II (left). When 
the linear network is transposed (all edges are reversed), the resulting network 
computes the transposed-matrix operation: a size-2 DCT-III (right). 



linear algorithm can be viewed as a linear network, and the 
transpose operation is computed by the network transposition 
of this algorithm [52]. To review, a linear network represents 
the algorithm by a directed acyclic graph, where the edges 
represent multiplications by constants and the vertices repre- 
sent additions of the incoming edges. Network transposition 
simply consists of reversing the direction of every edge. We 
prove below that the transposed network requires the same 
number of additions and multiplications for networks with the 
same number of inputs and outputs, and therefore the DCT-III 
can be computed in the same number of flops as the DCT- 
II by the transposed algorithm. The DCT-III corresponding to 
the transpose of eq. (4) is 



N-l 

71=0 



N 



k 



(15) 



Since the DCT-III is the transpose of the DCT-II, a fast DCT- 
III algorithm is obtained by network transposition of a fast 
DCT-II algorithm. For example, network transposition of a 
size-2 DCT-III is shown in figure. 3. 

A proof that the transposed network requires the same 
number of flops is as follows. Clearly, the number of multipli- 
cations, the number of edges with weight ^ ±1, is unchanged 
by transposition. The number of additions is given by the sum 
of indegree — 1 for all the vertices, except for the N input 
vertices which have indegree zero. That is, for a set V of 
vertices and a set E of edges: 

# adds = N + [indegree^) — 1] 

vev 

= N+\E\-\V\, (16) 

using the fact that the sum of the indegree over all vertices is 
simply the number \E\ of edges. Because the above expression 
is obviously invariant under network transposition as long as 
N is unchanged (i.e. same number of inputs and outputs), the 
number of additions is invariant. 

More explicitly, a fast DCT-III algorithm derived from the 
transpose of our new Algorithm 6 is outlined in Algorithm 7. 
Whereas for the DCT-II we computed the real-input DFT (with 
conjugate-symmetric output) of the rearranged inputs y n and 
then post-processed the complex outputs Z k = Z* N _ k , now 
we do the reverse. We first preprocess the inputs to form 
a complex array Zk = Z* N _ k , then perform a real-output, 
scaled-input DFT to obtain %jk, and finally assign the even and 
odd elements of the result C k from the first and second halves 
of jjk- Without the scale factors, this is equivalent to a well- 
known algorithm to express a DCT-III in terms of a realoutput 



Algorithm 7 New DCT-III algorithm of size N = 2 m , based 
on network transposition of Algorithm 6, with the same flop 
count. 



function D k=a ,, N 
Z *- 2x Q 

for n = 1 to N/2 - 1 do 

Z n < 2ui^Sj^, n (x n — iXN 
Zn-u < Z* n 

end for 

Zm/9 <— V / 2a;jv/2 



newdctllljv (x n ): 



mewfftSi? 5 1 (Z n ) 
1 do 



J N/2 

Vk «- '"^n^^ 

for k = to N/2 

Vk 

-1 <- VN 

end for 



°2fc 



1-k 



DFT of the same size [30], [33]. In order to minimize the 
flop count once the scale factors are included, however, it is 
important that this real-output DFT operate on scaled inputs 
rather than scaled outputs, so it is intrinsically different (trans- 
posed) from Algorithm 4. The real-output (scaled, conjugate- 
symmetric input) version of rnewfFtSjy (x n ) can be derived 
by network transposition of the real-input case (since network 
transposition interchanges inputs and outputs). Equivalently, 
whereas the rnewfftSjv (x n ) was a scaled-output decimation- 
in-time algorithm specialized for real inputs, the transposed 
algorithm rnewfftsU'' 1 (x n ) is a scaled-input decimation-in- 
frequency algorithm specialized for real outputs. We do not list 
rnewfftS^ 1 (x n ) explicitly here, however; our main point 
is simply to establish that an algorithm for the DCT-III with 
exactly the same flop count (1) follows immediately from the 
new DCT-II. 

There are also other ways to derive a DCT-III algorithm 
with the same flop count without using network transposition, 
of course. For example, one can again consider the DCT-III 
to be a real-input DFT of size 47V with certain symmetries 
(different from the DCT-II's symmetries), and prune redundant 
operations from Algorithm 5, resulting in a decimation-in-time 
algorithm. Such a derivation is useful because it allows one 
to derive a scaled-owfpMf DCT-III as well, which turns out to 
be a subroutine of a new DCT-IV algorithm that we describe 
in another manuscript, currently in press [53]. However, this 
derivation is more complicated than the one above, resulting 
in four mutually recursive DCT-III subroutines, and does not 
lower the flop count, so we omit it here. 



B. DST-II and DST-III 

The DST-II is defined, using a unitary normalization, as: 



for k = 1, 



2-8, 



■N-l 
k,N \ ^ 



N 



71=0 



x„ sin 



\n ( 




— nH 




N \ 





(17) 



. , N (not 0, . . . , N — 1). As in section II, it is 
more convenient to develop algorithms for an unnormalized 
variation: 

N-l 



S k = 2 ^2 x n sin 



(18) 
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similar to our normalization of Ck and above. Although 
we could derive fast algorithms for Sk directly by treating it 
as a DFT of length AN with odd symmetry, interleaved with 
zeros, and discarding redundant operations similar to above, it 
turns out there is a simpler technique. The DST-II is exactly 
equivalent to a DCT-II in which the outputs are reversed and 
every other input is multiplied by —1 [11]— [13]: 



Sn- 



N-l 

n=0 



(— l) n x n COS 



(19) 



for k = 0, ... , N— 1. It therefore follows that a DST-II can be 
computed with the same number of flops as a DCT-II of the 
same size, assuming that multiplications by —1 are free — the 
reason for this is that sign flips can be absorbed at no cost 
by converting additions into subtractions or vice versa in the 
subsequent algorithmic steps. Therefore, our new flop count 
(1) immediately applies to the DST-II. 

Similarly, a DST-III is given by the transpose of the DST- 
II (which is the inverse, for the unitary normalization). In 
unnormalized form, this is 



N 



x n sin 



71=1 









-0] 







(20) 



for k = 0, . . . , N — 1. This must have the same number of 
flops as the DST-II by the network transposition argument 
above. Alternatively, it can also be obtained from the DCT-III 
by reversing the inputs and multiplying every other output by 
-1 [12]: 



JV-l 



Si = (-if X N-nCOS 









-0] 







(21) 



which can be obtained from Cj with the same number of 



flops. 



VII. Conclusion 



We have shown that a improved count of real additions 
and multiplications (flops), eq. (1), can be obtained for the 
DCT/DST types II/III by pruning redundant operations from 
a recent improved FFT algorithm. We have also shown how 
to save N multiplications by rescaling the outputs of a DCT- 
II (or the inputs of a DCT-III), generalizing a well-known 
technique for N — 8 [42]. However, we do not claim that 
our new count is optimal — it may be that further gains can be 
obtained by, for example, extending the rescaling technique 
of [1] to greater generality. In particular, as has been pointed 
out by other authors [8], any improvement in the arithmetic 
complexity or flop counts for the DFT immediately leads to 
corresponding improvements in DCTs/DSTs, and vice versa. 
Our most important result, we believe, is the fact that there 
suddenly appears to be new room for improvement in problems 
that had seen no reductions in flop counts for many years. 

The question of the minimal operation counts for basic 
linear transformations such as DCTs and DSTs is of longstand- 
ing theoretical interest. The practical impact of a few percent 
improvement in flop counts, admittedly, is less clear because 
computation time on modern hardware is often not dominated 



by arithmetic [10]. Nevertheless, minimal-arithmetic hard- 
coded DCTs of small sizes are often used in audio and 
image compression [43], [47], and the availability of any new 
algorithm with regular structure amenable to implementation 
leads to rich new opportunities for performance optimization. 

Similar arithmetic improvements also occur for other types 
of DCT and DST [1], as well as for the modified DCT (MDCT, 
closely related to a DCT-IV), and the explicit description of 
those algorithms is the subject of another manuscript currently 
in press [53]. We have already shown that the new exact count 
for the DCT-I is 2Nlog 2 N-3N+2log 2 N+5-M(2N)/4 = 
%N\og 2 N + 0(N), where M(N) is given by eq. (13) [1]. 
However, no new exact count for arbitrary N = 2 m has yet 
been published for the DCT-IV and related transforms. Again, 
it immediately follows that the asymptotic cost of a DCT-IV 
(and hence an MDCT) is reduced to ^-N\og 2 N + 0(N) 
simply by applying known algorithms to express a DCT- 
IV in terms of a DCT-II of size N [12] (although this 
algorithm has large numerical errors [10]) or in terms of 
two DCT-II/III transforms of size N/2 [19]. However, again 
we wish to establish a new lowest-known upper bound for 
the exact count of operations of the DCT-IV, and not just 
the asymptotic constant factor. (It turns out that the DCT-IV 
algorithm obtained by pruning redundant computations from 
our new FFT, this time of symmetric data with size 8N [10], 
is closely related to the algorithm by Wang et al. [19], just 
as as the algorithm for the DCT-II in this paper was closely 
related to a known algorithm derived by other means [30], 
[33].) 

This work was supported in part by a grant from the 
MIT Undergraduate Research Opportunities Program. The 
authors are also grateful to M. Frigo, co-author of FFTW with 
S. G. Johnson [10], for many helpful discussions. 
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